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ABSTRACT 

We study the problem of group linkage : linking records that refer to en¬ 
tities in the same group. Applications for group linkage include finding 
businesses in the same chain, finding conference attendees from the same 
affiliation, finding players from the same team, etc. Group linkage faces 
challenges not present for traditional record linkage. First, although differ¬ 
ent members in the same group can share some similar global values of an 
attribute, they represent different entities so can also have distinct local val¬ 
ues for the same or different attributes, requiring a high tolerance for value 
diversity. Second, groups can be huge (with tens of thousands of records), 
requiring high scalability even after using good blocking strategies. 

We present a two-stage algorithm: the first stage identifies cores contain¬ 
ing records that are very likely to belong to the same group, while being 
robust to possible erroneous values; the second stage collects strong evi¬ 
dence from the cores and leverages it for merging more records into the 
same group, while being tolerant to differences in local values of an at¬ 
tribute. Experimental results show the high effectiveness and efficiency of 
our algorithm on various real-world data sets. 

1. INTRODUCTION 

Record linkage aims at linking records that refer to the same real- 
world entity and it has been extensively studied in the past years 
(surveyed in (8][T9)). In this paper we study a related but different 
problem that we call group linkage : linking records that refer to 
entities in the same group. 

One major motivation for our work comes from identifying busi¬ 
ness chains- connected business entities that share a brand name 
and provide similar products and services ( e.g ., Walmart, McDon¬ 
ald's). With the advent of the Web and mobile devices, we are 
observing a boom in local search ; that is, searching local busi¬ 
nesses under geographical constraints. Local search engines in¬ 
clude Google Maps, Yahoo! Local, YellowPages, yelp, ezlocal, etc. 
The knowledge of business chains can have a big economic value to 
local search engines, as it allows users to search by business chain, 
allows search engines to render the returned results by chains, al¬ 
lows data collectors to clean and enrich information within the 
same chain, allows the associated review system to connect reviews 
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Table 1: Identified top-5 US business chains. For each chain, we show 
the number of stores, distinct business names, distinct phone numbers, 
distinct URL domain names, and distinct categories. 


Name 

#Store 

#Name 

#Phn 

#URL 

#Cat 

SUBWAY 

21,912 

772 

21,483 

6 

23 

Bank of America 

21,727 

48 

6,573 

186 

24 

U-Haul 

21,638 

2,340 

18,384 

14 

20 

USPS - United State Post Office 

19,225 

12,345 

5,761 

282 

22 

McDonald’s 

17,289 

2401 

16,607 

568 

47 


on branches of the same chain, and allows sales people to target 
potential customers. Business listings are rarely associated with 
specific chains explicitly in real-world business-listing collections, 
so we need to identify the chains. Sharing the same name, phone 
number, or URL domain name can all serve as evidence of belong¬ 
ing to the same chain. However, for US businesses alone there are 
tens of thousands of chains and as we show soon, we cannot easily 
develop any rule set that applies to all chains. 

We are also motivated by applications where we need to find 
people from the same organization, such as counting conference at¬ 
tendees from the same affiliation, counting papers by authors from 
the same institution, and finding players of the same team. The or¬ 
ganization information is often missing, incomplete, or simply too 
heterogeneous to be recognized as the same (e.g., “International 
Business Machines Corporation”, “IBM Corp”, “IBM”, “IBM Re¬ 
search Labs”, “IBM-Almaden”, etc., all refer to the same organi¬ 
zation). Contact phones, email addresses, and mailing addresses of 
people all provide extra evidence for group linkage, but they can 
also vary for different people even in the same organization. 

Group linkage faces challenges not present for traditional record 
linkage. First, although different members in the same group can 
share some similar global values of an attribute, they represent dif¬ 
ferent entities so can also have distinct local values for the same 
or different attributes. For example, different branches in the same 
business chain can provide different local phone numbers, different 
addresses, etc. It is non-trivial to distinguish such differences from 
various representations for the same value and sometimes erro¬ 
neous values in the data. Second, there are often millions of records 
for group linkage, and a group can contain tens of thousands of 
members. A good blocking strategy should put these tens of thou¬ 
sands of records in the same block; but performing record linkage 
via traditional pairwise comparisons within such huge blocks can 
be very expensive. Thus, scalability is a big challenge. We use the 
following example of identifying business chains throughout the 
paper for illustration. 

Example 1.1. We consider a set of 18M real-world business 
listings in the US extracted from Yellowpages.com, each describing 
a business by its name, phone number, URL domain name, location, 
and category. Our algorithm automatically finds 600K business 
chains and 2.7M listings that belong to these chains. Table [ 7 ] lists 










Table 2: Real-world business listings. We show only state for location 
and simplify names of category. There is a wrong value in italic font. 


RID 

name 

phone 

URL (domain) 

location 

category 

r 1 

Home Depot, The 

808 


NJ 

furniture 

r 2 

Home Depot, The 

808 


NY 

furniture 

r 3 

Home Depot, The 

808 

homedepot 

MD 

furniture 

r 4 

Home Depot, The 

808 

homedepot 

AK 

furniture 

r5 

Home Depot, The 

808 

homedepot 

MI 

furniture 

re 

Home Depot, The 

101 

homedepot 

IN 

furniture 

r 7 

Home Depot, The 

102 

homedepot 

NY 

furniture 

rg 

Home Depot, USA 

103 

homedepot 

wv 

furniture 

r 9 

Home Depot USA 

808 


SD 

furniture 

r 10 

Home Depot - Tools 

808 


FL 

furniture 

t*u 

Taco Casa 


tacocasa 

AL 

restaurant 

r 12 

Taco Casa 

900 

tacocasa 

AL 

restaurant 

r 13 

Taco Casa 

900 

tacocasa, 

AL 

restaurant 




tacocasatexas 



r 14 

Taco Casa 

900 


AL 

restaurant 

7*15 

Taco Casa 

900 


AL 

restaurant 

7-16 

Taco Casa 

701 

tacocasatexas 

TX 

restaurant 

7*17 

Taco Casa 

702 

tacocasatexas 

TX 

restaurant 

r is 

Taco Casa 

703 

tacocasatexas 

TX 

restaurant 

7-19 

Taco Casa 

704 


NY 

food store 

7*20 

Taco Casa 


tacodelmar 

AK 

restaurant 


the largest five chains we found. We observe that (1) each chain 
contains up to 22K different branch stores, (2) different branches 
from the same chain can have a large variety of names, phone num¬ 
bers, and URL domain names, and (3) even chains of similar sizes 
can have very different numbers of distinct URLs (same for other 
attributes). Thus, rule-based linkage can hardly succeed and seal- 
ability is a necessity. 

Table^shows a set of 20 business listings (with some abstrac¬ 
tion) in this data set. After investigating their webpages manu¬ 
ally, we find that n — r is belong to three business chains: Chi = 
{ri - rio }, Ch 2 — {m - 7*15}, and Ch 3 — {ri 6 - r*i 8 }; n 9 and 
7*20 do not belong to any chain. Note the slightly different names 
for businesses in chain Chi; also note that r 13 is integrated from 
different sources and contains two URLs, one (tacocasatexas ) being 
wrong. 

Simple linkage rules do not work well on this data set. For ex¬ 
ample, if we require only high similarity on name for chain iden¬ 
tification, we may wrongly decide that m — r 2 o all belong to the 
same chain as they share a popular restaurant name Taco Casa. 
Traditional linkage strategies do not work well either. If we apply 
Swoosh-style linkage iHzF and iteratively merge records with high 
similarity on name and shared phone or URL, we can wrongly 
merge Ch 2 and Ch 3 because of the wrong URL from n 3 . If we 
require high similarity between listings on name, phone, URL, 
category, we may either split r§ — t* 8 out of chain Chi because 
of their different local phone numbers, or learn a low weight for 
phone but split 7*9 — no out of chain Chi since sharing the same 
phone number, the major evidence, is downweighted. □ 

The key idea in our solution is to find strong evidence that can 
glue group members together, while being tolerant to differences 
in values specific for individual group members. For example, we 
wish to reward sharing of primary values, such as primary phone 
numbers or URL domain names for chain identification, but would 
not penalize differences from local values, such as locations , local 
phone numbers, and even categories. For this purpose, our algo¬ 
rithm proceeds in two stages. First, we identify cores containing 
records that are very likely to belong to the same group. Second, 
we collect strong evidence from the resulting cores, such as primary 
phone numbers and URL domain names in business chains, based 
on which we cluster the cores and remaining records into groups. 
The use of cores and strong evidence distinguishes our clustering 
algorithm from traditional clustering techniques for record linkage. 
In this process, it is crucial that core generation makes very few 


false positives even in the presence of erroneous values, such that 
we can avoid ripple effect on clustering later. Our algorithm is de¬ 
signed to ensure efficiency and scalability. 

The group linkage problem we study in this paper is different 
from the group linkage in jl^] [23], which decides similarity be¬ 
tween pre-specified groups of records. Our goal is to find records 
that belong to the same group and we make three contributions. 

1. We study core generation in presence of erroneous data. Our 
core is robust in the sense that even if we remove a few pos¬ 
sibly erroneous records from a core, we still have strong ev¬ 
idence that the rest of the records in the core must belong to 
the same group. 

2. We then reduce the group linkage problem into clustering 
cores and remaining records. Our clustering algorithm lever¬ 
ages strong evidence collected from cores and meanwhile is 
tolerant to value variety of records in the same group. 

3. We conducted experiments on two real-world data sets in dif¬ 
ferent domains, showing high efficiency and effectiveness of 
our algorithms. 

Note that we assume prior to group linkage, we first conduct 
record linkage ( e.g ., (T3]). Our experiments show that minor mis¬ 
takes for record linkage do not significantly affect the results of 
group linkage, and records that describe the same entity but fail to 
be merged in the record-linkage step are often put into the same 
group. We plan to study how to combine record linkage and group 
linkage to improve the results of both in the future. 

In the rest of the paper, Section [2] discusses related work. Sec¬ 
tion [3] defines the problem and provides an overview of our solu¬ 
tion. Sections [4p] describe the two stages in our solution. Section[6] 
describes experimental results. Section[7]concludes. 

2. RELATED WORK 

Record linkage has been extensively studied in the past (sur¬ 
veyed in (8] [T9j). Traditional linkage techniques aim at linking 
records that refer to the same real-world entity, so implicitly assume 
value consistency between records that should be linked. Group 
linkage is different in that it aims at linking records that refer to 
different entities in the same group. The variety of individual enti¬ 
ties requires better use of strong evidence and tolerance on different 
values even within the same group. These two features differentiate 
our work from any previous linkage technique. 

For record clustering in linkage, existing work may apply the 
transitive rule GZ) or do match-and-merge (27) , or reduce it to an 
optimization problem |16| . Our work is different in that our core¬ 
identification algorithm aims at being robust to a few erroneous 
records; and our clustering algorithm emphasizes leveraging the 
strong evidence collected from the cores. 

For record-similarity computation, existing work can be rule based 
fT71 , classification based [11], or distance based (6). There has also 
been work on weight (or model) learning from labeled data m 
[29) . Our work is different in that in addition to learning a weight 
for each attribute, we also learn a weight for each value based on 
whether it serves as important evidence for the group. Note that 
some previous works are also tolerant to different values but lever¬ 
age evidence that may not be available in our contexts: fTO) is toler¬ 
ant to schema heterogeneity from different relations by specifying 
matching rules; 03 is tolerant to possibly false values by consid¬ 
ering agreement between different data providers; ED is tolerant 
to out-of-date values by considering time stamps; we are tolerant 
to diversity within the same group. 

Two-stage clustering has been proposed in the IR and machine 
learning community | fj [20] [22] [28] [30] ; however, they identify 














cores in different ways. Techniques in |20[ |28) consider a core 
as a single record, either randomly selected or selected accord¬ 
ing to the weighted degrees of nodes in the graph. Techniques 
in |[30) generate cores using agglomerative clustering but can be 
too conservative and miss strong evidence. Techniques in |T) iden¬ 
tify cores as bi-connected components, where removing any node 
would not disconnect the graph. Although this corresponds to the 
1-robustness requirement in our solution (defined in Section |4|, 
they generate overlapping clusters; it is not obvious how to de¬ 
rive non-overlapping clusters in applications such as business-chain 
identification and how to extend their techniques to guarantee k- 
robustness. Finally, techniques in J20| [22) require knowledge of 
the number of clusters for one of the stages, so do not directly ap¬ 
ply in our context. We compare with these methods whenever ap¬ 
plicable in experiments (Section [6), showing that our algorithm is 
robust in presence of erroneous values and consistently generates 
high-accuracy results on data sets with different features. 

Finally, we distinguish our work from the group linkage in JT8j 
[23) , which has different goals. On et al. (23) decided similarity 
between pre-specified groups of records and the group-entity rela¬ 
tionship is many-to-many ( e.g ., authors and papers). Huang )I8) 
decided whether two pre-specified groups of records from different 
data sources refer to the same group by analysis of social network. 
Our goal is to find records that belong to the same group. 

3. OVERVIEW 

This section formally defines the group linkage problem and pro¬ 
vides an overview of our solution. 

3.1 Problem definition 

Let R be a set of records that describe real-world entities by a set 
of attributes A. For each record re R, we denote by r.A its value 
on attribute A E A. Sometimes a record may contain erroneous or 
missing values. 

We consider the group linkage problem; that is, finding records 
that represent entities belonging to the same real-world group. As 
an example application, we wish to find business chains-a set of 
business entities with the same or highly similar names that provide 
similar products and services (e.g., Walmart, Home Depot, Subway 
and McDonald's )Q We focus on non-overlapping groups, which 
often hold in applications. 

Definition 3 .1 (Group linkage) . Given a set R of records, 
group linkage identifies a set of clusters CH of records in R, such 
that (1) records that represent real-world entities in the same group 
belong to one cluster, and (2) records from different groups belong 
to different clusters. □ 

Example 3.2. Consider records in Example ED where each 
record describes a business store (at a distinct location) by at¬ 
tributes name, phone, URL, location, and category. 

The ideal solution to the group linkage problem contains 5 clus¬ 
ters: Ch\ = {ri - no}, Ch 2 = {rn - r 15 }, Ch 3 = {r 16 - r 18 }, 
Ch/i — {rig}, and Ch 3 — {r2o}. Among them, Ch 2 and Ch 3 
represent two different chains with the same name. □ 

3.2 Overview of our solution 

Group linkage is related to but different from traditional record 
linkage because it essentially looks for records that represent enti¬ 
ties in the same group, rather than records that represent exactly the 
same entity. Different members in the same group often share a cer¬ 
tain amount of commonality (e.g., common name, primary phone, 

1 http://en.wikipedia.org/wiki/Chain_store 


and URL domain of chain stores), but meanwhile can also have a 
lot of differences (e.g., different addresses, local phone numbers, 
and local URL domains); thus, we need to allow much higher vari¬ 
ety in some attribute values to avoid false negatives. On the other 
hand, as we have shown in Example |1.1| simply lowering our re¬ 
quirement on similarity of records or similarity of a few attributes 
in clustering can lead to a lot of false positives. 

The key intuition of our solution is to distinguish between strong 
evidence and weak evidence. For example, different branches in the 
same business chain often share the same URL domain name and 
those in North America often share the same 1-800 phone number. 
Thus, a URL domain or phone number shared among many busi¬ 
ness listings with highly similar names can serve as strong evidence 
for chain identification. In contrast, a phone number shared by only 
a couple of business entities is much weaker evidence, since one 
might be an erroneous or out-of-date value. 

To facilitate leveraging strong evidence, our solution consists of 
two stages. The first stage collects records that are highly likely to 
belong to the same group; for example, a set of business listings 
with the same name and phone number are very likely to be in the 
same chain. We call the results cores of the groups; from them we 
can collect strong evidence such as name, primary phone number, 
and primary URL domain of chains. The key goal of this stage is to 
be robust against erroneous values and make as few false positives 
as possible, so we can avoid identifying strong evidence wrongly 
and causing incorrect ripple effect later; however, we need to keep 
in mind that being too strict can miss important strong evidence. 

The second stage clusters cores and remaining records into groups 
according to the discovered strong evidence. It decides whether 
several cores belong to the same group, and whether a record that 
does not belong to any core actually belongs to some group. It also 
employs weak evidence, but treats it differently from strong evi¬ 
dence. The key intuition of this stage is to leverage the strong evi¬ 
dence and meanwhile be tolerant to diversity of values in the same 
group, so we can reduce false negatives made in the first stage. 

We next illustrate our approach for business-chain identification. 


Example 3 . 3 . Continue with the motivating example. In the 
first stage we generate three cores: Cr\ — {n — 7*7}, Cr2 = 
{r 14, 7 * 15 }, Cr 3 — {ri6 — 7 Ts}. Records n — r 7 are in the same 
core because they have the same name, five of them (r 1 — 7*5) share 
the same phone number 808 and five of them (r 3 — 7*7) share the 
same URL homedepot. Similar for the other two cores. Note that 
7*13 does not belong to any core, because one of its URLs is the 
same as that ofr\\ — r 12, and one is the same as that ofriQ — rig , 
but except name, there is no other common information between 
these two groups of records. To avoid mistakes, we defer the deci¬ 
sion on 7*13. Indeed, recall that tacocasatexas is a wrong value for 
7*13. For a similar reason, we defer the decision on r\ 2 . 

In the second stage, we generate groups-business chains. We 
merge r 8 — 7*10 with core Or, because they have similar names 
and share either the primary phone number or the primary URL. 
We also merge m — 7*13 with core Cr 2 , because (l)r \ 2 — 7*13 share 
the primary phone 900 with Cr 2 , and ( 2 ) 7*11 shares the primary 
URL tacocasa with ri 2 — r\ 3 . We do not merge Cr 2 and Cr 3 though, 
because they share neither the primary phone nor the primary URL. 
We do not merge rig or r 23 to any core, because there is again not 
much strong evidence. We thus obtain the ideal result. □ 

To facilitate this two-stage solution, we find attributes that pro¬ 
vide evidence for group identification and classify them into three 
categories. 



• Common-value attribute: We call an attribute A a common- 
value attribute if all entities in the same group have the same 
or highly similar A-values. Such attributes include business- 
name for chain identification and organization for organi¬ 
zation linkage. 

• Dominant-value attribute: We call an attribute A a dominant- 
value attribute if entities in the same group often share one or 
a few primary A-values (but there can also exist other less- 
common values), and these values are seldom used by en¬ 
tities outside the group. Such attributes include phone and 
URL-domain for chain identification, and office-address, 
phone-prefix, and email-server for organization linkage. 

• Multi-value attribute: We call the rest of the attributes mutli- 
value attributes as there is often a many-to-many relation¬ 
ship between groups and values of these attributes. Such at¬ 
tributes include category for chain identification. 

The classification can be either learned from training data based 
on cardinality of attribute values, or performed by domain experts 
since there are typically only a few such attributes. 

We describe core identification in Section [4] and group linkage 
in Section[5] Our algorithms require common-value and dominant- 
value attributes, which typically exist for groups in practice. While 
we present the algorithms for the setting of one machine, a lot of 
components of our algorithms can be easily parallelized in Hadoop 
infrastructure (24] [4); it is not the focus of the paper and we briefly 
describe the opportunities in Section [64] 

4. CORE IDENTIFICATION 

The first stage of our solution creates cores consisting of records 
that are very likely to belong to the same group. The key goal 
in core identification is to be robust to possible erroneous values. 
This section starts with presenting the criteria we wish the cores 
to meet (Section [471] ), then describes how we efficiently construct 
similarity graphs to facilitate core finding (Section [4~2|), and finally 
gives the algorithm for core identification (Section p~3] >. Note that 
the notations in this section can be slightly different from those in 
Graph Theory. 

4.1 Criteria for a core 

At the first stage we wish to make only decisions that are highly 
likely to be correct; thus, we require that each core contains only 
highly similar records, and different cores are fairly different and 
easily distinguishable from each other. In addition, we wish that 
our results are robust even in the presence of a few erroneous values 
in the data. In the motivating example, n — form a good core, 
because 808 and homedepot are very popular values among these 
records. In contrast, 7*13 — ns do not form a good core, because 
records 7*14 — 7*15 and 7*16 — rig do not share any phone number 
or URL domain; the only “connector” between them is 7*13, so they 
can be wrongly merged if 7*13 contains erroneous values. Also, 
considering 7*13 — ?T5 and 7*16 — r is as two different cores is risky, 
because (1) it is not very clear whether 7*13 is in the same chain as 
ri 4 — r 15 or as 7*16 — 7*18, and (2) these two cores share one URL 
domain name so are not fully distinguishable. 

We capture this intuition with connectivity of a similarity graph. 
We define the similarity graph of a set R of records as an undi¬ 
rected graph, where each node represents a record in R, and an 
edge indicates high similarity between the connected records (we 
describe later what we mean by high similarity). Figure [T] shows 
the similarity graph for the motivating example. 

Each core would correspond to a connected sub-graph of the sim¬ 
ilarity graph. We wish such a sub-graph to be robust such that 



Figure 1: Similarity graph for records in Table [5] 


even if we remove a few nodes the sub-graph is still connected; 
in other words, even if there are some erroneous records, without 
them we still have enough evidence showing that the rest of the 
records should belong to the same group. The formal definition 
goes as follows. 

Definition 4.1 (/c-robustness). A graph G is k -robust if 
after removing arbitrary k nodes and edges to these nodes, G is 
still connected. A clique or a single node is k-robustfor any k. □ 

In Figure [I] the subgraph with nodes n — 7*7 is 2 -robust. That 
with 7*11 — 7*18 is not 1 -robust, as removing 7*13 can disconnect it. 

According to the definition, we can partition the similarity graph 
into a set of k -robust subgraphs. As we do not wish to split any 
core unnecessarily, we require the maximal k-robust partitioning : 

Definition 4.2 (Maximal /c-robust partitioning). Let 
G be a similarity graph. A partitioning of G is a maximal /c-robust 
partitioning if it satisfies the following properties. 

1. Each node belongs to one and only one partition. 

2. Each partition is k-robust. 

3. The result of merging any partitions is not k-robust. □ 

Note that a data set can have more than one maximal /c-robust 
partitioning. Consider m — ris in Figure]!] There are three max¬ 
imal 1 -robust partitionings: {{m}, {r 12 , tt 4 - 7*15}, {r 13 , r 16 - 
ns}}; {{rn - 7*12}, {ri 4 - 7*15}, {7*13, T-ie - tis}}; and {{r n - 
7*15}, {ti 6 —7*18}}. If we treat each partitioning as apossible world, 
records that belong to the same partition in all possible worlds 
have high probability to belong to the same group and so form a 
core. Accordingly, we define a core as follows and can prove its 
/c-robustness. 

Definition 4.3 (k-C ore). Let ~Rbe a set of records and G 
be the similarity graph of R. The records that belong to the same 
subgraph in every maximal k-robust partitioning of G form a k- 
core of R. A core contains at least 2 records. □ 

Property 4.4. A k-core is k-robust. □ 

PROOF. If a k- core C r of G is not /c-robust, there exists a maxi¬ 
mal /c-robust partitioning in G , where two nodes r and r' in C r are 
in different partitions of this partitioning (proved by Lemma [ 4719 ] ). 
This conflicts with the fact that records in C r belong to the same 
partition in every maximal /c-robust partitioning of G. Therefore, a 
/c-core is /c-robust. □ 

Example 4.5. Consider Figure^and assume k = 1. There 
are two connected sub-graphs. For records 7*1 — 7*7, the subgraph 
is 1-robust, so they form a 1 -core. For records rn — ris, there are 
three maximal 1-robust partitionings for the subgraph, as we have 
shown. Two subsets of records belong to the same subgraph in each 
partitioning: {ri 4 — ris} and {t\q — ris}; they form 2 1 -cores.U 










Table 3: Simplified inverted index for the similarity graph in 
Figure [lj 


Record 

V-Cliques 

Represent 

7*1/2 

Ci 

7*1 — r 2 

7*3 

Ci,C 2 

7*3 

7*4 

Ci,C 2 

7*4 

7*5 

Ci,C 2 

7*5 

7*6/7 

c 2 

7*6 - r 7 

7*11 

c 3 

7*11 

7*12 

c 3 ,c 4 

7*12 

7*13 

c 3 ,c 4 ,c 5 

7*13 

7*14/15 

c 4 

7*14 - 7*15 

7*16/17/18 

c 5 

7*16 - 7*18 


4.2 Constructing similarity graphs 

Generating the cores requires analysis on the similarity graph. 
Even after blocking, a block can contain tens of thousands of records, 
so it is not scalable to compare every pair of records in the same 
block and create edges accordingly. We next describe how we con¬ 
struct and represent the similarity graph in a scalable way. 

We add an edge between two records if they have the same value 
for each common-value attribute and share at least one value on 
a dominant-value attribut^j our experiments show advantages of 
this method over other edge-adding strategies (Section [6.2.1| ). All 
records that share values on the common-value attributes and share 
the same value on a dominant-value attribute form a clique, which 
we call a v-clique. We can thus represent the graph with a set of 
v-cliques, denoted by C; for example, the graph in Figure[T]can be 
represented by 5 v-cliques (Ci — C5). In addition, we maintain an 
inverted index L , where each entry corresponds to a record r and 
contains the v-cliques that r belongs to. Whereas the size of the 
similarity graph can be quadratic in the number of the nodes, the 
size of the inverted index is only linear in that number. The inverted 
index also makes it easy to find adjacent v-cliques (i.e., v-cliques 
that share nodes), as they appear in the same entry. 

Graph construction is then reduced to v-clique finding, which 
can be done by scanning values of dominant-value attributes. In 
this process, we wish to prune a v-clique if it is a sub-clique of 
another one. Pruning by checking every pair of v-cliques can be 
very expensive since the number of v-cliques is also huge. Instead, 
we do it together with v-clique finding. Specifically, our algorithm 
GraphConstruction takes R as input and outputs C and L. 
We start with C = L — 0 . For each value v of a dominant-value 
attribute, we denote the set of records with v by R v and do the 
following. 

1. Initialize the v-cliques for v as C v = 0 . Add a single-record 
cluster for each record r G R v to a working set T. Mark 
each cluster as “unchanged”. 

2. For each r G R v , scan L and consider each v-clique C G 
L{r ) that has not been considered yet. For all records in 
C H R v , merge their clusters. Mark the merged cluster as 
“changed” if the result is not a proper sub-clique of C. If 
C C R v , remove C from C. This step removes the v-cliques 
that must be sub-cliques of those we will form next. 

3. For each cluster C G T, if there exists C' G C v such that 
C and C' share the same value for each common-value at¬ 
tribute, remove C and C' from T and C v respectively, add 
C U C' to T and mark it as “changed”; otherwise, move C to 
C v • This step merges clusters that share values on common- 
value attributes. At the end, C v contains the v-cliques with 
value v. 


2 In practice, we require only highly similar values for common-value at¬ 
tributes and apply the transitive rule on similarity (i.e., if v\ and V2 are 
highly similar, and so are V2 and V3, we consider v± and V3 highly similar). 


4. Add each v-clique with mark “changed” in C v to C and up¬ 
date L accordingly. The marking prunes size-1 v-cliques and 
the sub-cliques of those already in C. 

PROPOSITION 4.6. Let Ube a set of records. Denote by n{r) 
the number of values on dominant-value attributes from r G R. 
Let n — n ( r ) an d 771 — max rG R n(r). Let s be the maxi¬ 

mum v-clique size. Algorithm GRAPHCONSTRUCTION (1) runs in 
time 0(ns(m + s)), (2) requires space 0{n), and (3) its result is 
independent of the order in which we consider the records. □ 

Proof. We first prove that GraphConstruction runs in time 
0(ns(m + s)). Step 2 of the algorithm takes in time 0{nsmn ), 
where it takes in time 0(ns) to scan all records for a dominant- 
value attribute, and a record can be scanned maximally m times. 
Step 3 takes in time 0(ns 2 ). Thus, the algorithm runs in time 
0(ns(m + s)). 

We next prove that GraphConstruction requires space 0{n). 
For each value v of a dominate-value attribute, the algorithm keeps 
three data sets: L that takes in space O(n), C v and T that require 
space in total no greater than 0(|R|). Since 0{n) > 0(|R|), the 
algorithm requires space 0(n). 

We now prove that the result of GraphConstruction is order 
independent. Given L and R v , Step 2 scan L and apply transitive 
rule to merge clusters of records in C fl R v , for each v-clique C G 
L. The process is independent from the order in which we consider 
the records in R v . The order independence of the result in Step 3 
is proven in [2|. Therefore, the final result is independent from the 
order in which we consider the records. □ 

Example 4.7. Consider graph construction for records in Ta¬ 
ble^ Figur^shows the similarity graph and Table^a) shows the 
inverted list. We focus on records n — rgfor illustration. 

First, n — 7*5 share the same name and phone number 808 , so we 
add v-clique C\ — {r\ —7*5} to C. Now consider URL homedepot 
where R v — {7*3 — 7 * 8 }. Step 1 generates 6 clusters, each marked 
“unchanged”, and T — {{^*3 },..., {r 8 }}. Step 2 looks up L for 
each record in R v . Among them, 7*3 —7*5 belong to v-clique C\, so it 
merges their clusters and marks the result {7*3 — r5} “unchanged” 
({r 3 ~ r 5 } C Ci); then, f = {{r 3 - r* 5 }, {r 6 }, {r 7 }, {r 8 }}. 
Step 3 compares these clusters and merges the first three as they 
share the same name, marking the result as “changed”. At the end, 
as {{7*3 — 7*7}, {^ 8 }}- Finally, Step 4 adds {7*3 — r7} to C and 
discards {r 8 } since it is marked “unchanged”. □ 

Given the sheer number of records in R, the inverted index can 
still be huge. In fact, according to the following theorem, records 
in the same v-clique but not any other v-clique must belong to the 
same core, so we do not need to distinguish them. Thus, we sim¬ 
plify the inverted index such that for each v-clique we keep only 
a representative for nodes belonging only to this v-clique. Table [3] 
shows the simplified index for the similarity graph in Figure [T] 

Theorem 4.8. Let G be a similarity graph and G' be a graph 
derived from G by merging nodes that belong to only one and the 
same v-clique. Two nodes belong to the same core of G' if and only 
if they belong to the same core of G. □ 

PROOF. We need to prove that (1) if two nodes r and r' belong 
to the same core in G' , they are in the same core of G, and (2) if 
two nodes r and r' belong to the same core of G , they are in the 
same core of G'. 

We first prove that if two nodes r and r' belong to the same 
core in G', they are in the same core of G. Suppose there does 
not exist any core in G that contains both r and r . It means that 









there exists a maximal k- robust partitioning in G, where r and r' 
are in different partitions. Let P be such a partitioning of G and 
we consider partitioning P' of G ', where each pair of nodes in 
the same partition C of P are in the same partition C' of P' and 
vice versa. We prove that P' is a maximal /c-robust partitioning 
in G'. (1) It is obvious that each node in P' belongs to one and 
only one partition. (2) For each partition C' in P\ removing any 
k nodes in C' is equivalent to removing n + m nodes in G, where 
n nodes belong to more than one u-cliques in C, m nodes belong 
to single u-cliques in C , and n < k. Since removing m nodes 
that belong to single u-cliques do not disconnect C and we know 
n <k, removing the n + m nodes does not disconnect C. It in turn 
proves that removing k nodes in C' does not disconnect G', and 
C' is /c-robust. (3) Similarly, we have that the result of of merging 
any partitions in P' is not k- robust. Therefore, P' is a maximal 
/c-robust partitioning in G'. Given that r and r' are in different 
partitions of P', there does not exist a core of G' that contains both 
r and r'. This conflicts with the fact that r and r' belong to the 
same core in G', and further proves that r and r are in the same 
core of G. 

We next prove that if two nodes r and r' belong to the same 
core of G , they are in the same core of G f . Suppose there does not 
exist any core in G' that contains both r and r'. It means that there 
exists a maximal /c-robust partitioning in G r , where r and r' are 
in different partitions. Let P' be such a partitioning of G' and we 
consider partitioning P of G , where each pair of nodes in the same 
partition C' of P' are in the same partition C of P and vice versa. 
In similar ways as above, we have that P is a maximal /c-robust 
partitioning in G. Given that r and r are in different partitions of 
P, there does not exist a core of G that contains both r and r . This 
conflicts with the fact that r and r belong to the same core in G, 
and further proves that r and r' are in the same core of G'. □ 

Case study: On a data set with 18M records (described in Sec¬ 
tion]^)]), our graph-construction algorithm finished in 1.9 hours. The 
original similarity graph contains 18M nodes and 4.2B edges. The 
inverted index is of size 89MB, containing 3.8M entries, each as¬ 
sociated with at most 8 v-cliques; in total there are 1.2M v-cliques. 
The simplified inverted index is of size 34MB, containing 1.5M en¬ 
tries, where an entry can represent up to 11K records. Therefore, 
the simplified inverted index reduces the size of the similarity graph 
by 3 orders of magnitude. 

4.3 Identifying cores 

We solve the core-identification problem by reducing it to a Max- 
flow/Min-cut Problem. However, computing the max flow for a 
given graph G and a source-destination pair takes time 0(|G| 2 ' 5 ), 
where \G\ denotes the number of nodes in G; even the simplified 
inverted index can still contain millions of entries, so it can be very 
expensive. We thus first merge certain v-cliques according to a suf¬ 
ficient (but not necessary) condition for /c-robustness and consider 
them as a whole in core identification; we then split the graph into 
subgraphs according to a necessary (but not sufficient) condition 
for /c-robustness. We apply reduction only on the resulting sub¬ 
graphs, which are substantially smaller as we show at the end of 
this section. Section |4.3.1| describes screening before reduction, 
Section [4.3.2| describes the reduction, and Section [433] gives the 
full algorithm, which iteratively applies screening and the reduc¬ 
tion. 

4.3.1 Screening 

A graph can be considered as a union of v-cliques, so essentially 
we need to decide if a union of v-cliques is /c-robust. First, we can 
prove the following sufficient condition for k -robustness. 




Figure 2: Two example graphs. 

Theorem 4.9 ((K + 1)-connected condition). LetGbe 

a graph consisting of a union Q of v-cliques. If for every pair of 
v-cliques C , C' G Q, there is a path of v-cliques between C and 
C' and every pair of adjacent v-cliques on the path share at least 
k + 1 nodes , graph G is k-robust. □ 

PROOF. Given Menger’s Theorem (3]|, graph G is /c-robust if for 
any pair of nodes r, r' in G, there exists at least k + 1 independent 
paths that do not share any nodes other than r, r in G. We now 
prove that for any pair of nodes r, r in graph G that satisfies (k + 

1)-connected condition, there exists at least /c+1 independent paths 
between r, r'. We consider two cases, 1) r, r' are adjacent such 
that there exists a v-clique in G that contains r, r'\ 2) r , r' are not 
adjacent such that there exists no v-clique in G that contains r, r . 

We first consider Case 1 where there exists a v-clique G con¬ 
taining r, r'. Since each v-clique in G has more than k + 1 nodes, 
there exist at least k 2-length paths and one 1-length path between 
r, r' G G. It proves that there exists at least k + 1 independent 
paths between r and r'. 

We next consider Case 2 where there exists no v-clique contain¬ 
ing r, r' in G. Suppose r £ C,r f £ C', where G, C' are different 
v-cliques in G. Since there exists a path of v-cliques between G 
and C' where every pair of adjacent v-cliques in the path share at 
least k + 1 nodes, there exists at least k + 1 independent paths 
between r and r . 

Given the above two cases, we have that there exist at least k + 1 
independent paths between every pair of nodes in G, therefore G is 
/c-robust. □ 

We call a single v-clique or a union of v-cliques that satisfy the 
(/c+1)-connected condition a ( k-\-l)-connected v-union . A (/c+1)- 
connected v-union must be /c-robust but not vice versa. In Figure[l] 
subgraph {r\ — 7 * 7 } is a 3-connected v-union, because the only two 
v-cliques, Gi and G 2 , share 3 nodes. Indeed, it is 2-robust. On the 
other hand, graph Gi in Figure [2] is 2-robust but not 3-connected 
(there are 4 v-cliques, where each pair of adjacent v-cliques share 
only 1 or 2 nodes). Accordingly, we can consider a v-union as a 
whole in core identification. 

Next, we present a necessary condition for /c-robustness. 

Theorem 4.10 ((K + 1)-overlap condition). Graph G 

is k-robust only if for every (k + 1)-connected v-union Q G G, Q 
shares at least /c+1 common nodes with the subgraph consisting 
of the rest of the v-unions. □ 

PROOF. We prove that if graph G contains a (k + 1)-connected 
v-union Q that shares at most k common nodes with the rest of the 
graph, G is not /c-robust. Since Q shares at most k common nodes 
with the subgraph consisting of the rest of the v-unions, removing 
the common nodes will disconnect Q from G, it proves that G is 
not /c-robust. Thus, (/c + l)-overlap condition holds. □ 

We call a graph G that satisfies the (k + 1)-overlap condition a 
(/c + 1 )-overlap graph. A /c-robust graph must be a (/c + l)-overlap 
graph but not vice versa. In Figure [I] subgraph {rn — rig} is 
not a 2-overlap graph, because there are two 2-connected v-unions, 
{r 11 — 7 * 15 } and {ri 3 ,ri 6 — ns}, but they share only one node; 














indeed, the subgraph is not 1-robust. On the other hand, graph G 2 
in Figure [2] satisfies the 3-overlap condition, as it contains four 3- 
connected v-unions (actually four v-cliques), Q 1 — Q 4 , and each 
v-union shares 3 nodes in total with the others; however, it is not 2- 
robust (removing 7*3 and 7*4 disconnects it). Accordingly, for (k + 

1 ) -overlap graphs we still need to check ^-robustness by reduction 
to a Max-flow Problem. 

Now the problem is to find (k + 1)-overlap subgraphs. Let G be 
a graph where a (k + 1)-connected v-union overlaps with the rest 
of the v-unions on no more than k nodes. We split G by removing 
these overlapping nodes. For subgraph {m — tas} in Figurefl] we 
remove 7*13 and obtain two subgr aphs {m — 7*12, r i4 — 7*15/ and 
{ri6 — rig } (recall from Example |4~5| that 7*13 cannot belong to any 
core). Note that the result subgraphs may not be (k + l)-overlap 
graphs (e.g., {7*11 — 7*12, 7*14 — 7*15} contains two v-unions that share 
only one node), so we need to further screen them. 

We now describe our screening algorithm, SCREEN (details in 
Algorithm [TJ, which takes a graph G , represented by C and L, 
as input, finds {k + 1)-connected v-unions in G and meanwhile 
decides if G is a (k + 1)-overlap graph. If not, it splits G into 
subgraphs for further examination. 

1. If G contains a single node, output it as a core if the node 
represents multiple records that belong only to one v-clique. 

2. For each v-clique C E C, initialize a v-union. We denote 
the set of v-unions by Q, the v-union that C belongs to by 
Q(C), and the overlapping nodes of C and C' by B(C, C'). 

3. For each v-clique C G C, we merge v-unions as follows. 

(a) For each record r E C that has not been considered, for 
every pair of v-cliques C\ and C 2 in r’s index entry, if they 
belong to different v-unions, add r to overlap B(Ci, C 2 ). 

(b) For each v-union Q / Q(C) where there exist C\ E Q 
and C 2 E Q(C) such that |F?(Ci, 62 )! > k + 1 , merge Q 
and Q(C). 

At the end, Q contains all (k + 1)-connected v-unions. 

4. For each v-union Q E Q, find its border nodes as B(Q ) = 
UceQ,c'gQB(C, C'). If \B(Q)\ < k, split the subgraph it 
belongs to, denoted by G(Q), into two subgraphs Q \ B(Q) 
and G(Q) \ Q. 

5. Return the remaining subgraphs. 

PROPOSITION 4.11. Denote by \L\ the number of entries in in¬ 
put L. Let 777 be the maximum number of values from dominant- 
value attributes of a record , and a be the maximum number of 
adjacent v-unions that a v-union has. Algorithm SCREEN finds 
(k+ l)-overlap subgraphs in time 0 ((m 2 +a) • \L\) and the result 
is independent of the order in which we examine the v-cliques. □ 

PROOF. We first prove the time complexity of Screen. It takes 
in time 0 {m 2 \L\) to scan all entries in L and find common nodes 
between each pair of adjacent v-cliques (Step 3(a)). It takes in time 
0(a |C|) to merge v-unions, where |C| is the number of v-cliques 
in G (Step 3(b)). Since |C| < \L\, the algorithm runs in time 
0(777 2 + a ) • \ L \. 

We next prove that the result of Screen is independent of the 
order in which we examine the v-cliques, that is, 1) finding all 
maximal (k + 1)-connected v-unions in G is order independent; 

2) removing all nodes in B(Q) from G where \B(Q) | < k is order 
independent. 

Consider order independency of finding all v-unions in G. To 
find all v-unions in G is conceptually equivalent to find all con¬ 
nected components in an abstract graph Ga, where each node in 


Algorithm 1 SCREENING(0, C, L, k) 

Input: G: Simplified similarity graph. 

C : Set of k- cores. 

L: Inverted list of the similarity graph. 
k: Robustness requirement. 

Output: G Set of subgraphs in G. 

1: if G contains a single node r then 
2: if r represent multiple records then 

3: add 7 * to G. 

4: end if 

5: return G — f. 

6: else 

7: initialize v-union Q(C) for each v-clique C and add Q(C) 

to Q. 

8: // find v-union 

9: for each v-clique C E G do 

10: for each record r E C that is not proceeded do 

11: for each v-clique pair C\ , C 2 E L(r) do 

12: if Ci , C 2 are in different v-unions then 

13: add r to overlap B(Q(Ci), Q(C 2 ))- 

14: end if 

15: end for 

16: end for 

17: for each v-union Q where B(Q : Q(C )) > k do 

18: merge Q and Q(C) as Q m . 

19: for each v-union Q' ^ Q, Q' Q(C) do 

20 : set B(Q', Q m ) = B(QQ) U B(Q’, Q(C)) 

21: end for 

22 : end for 

23: end for 

24: // screening 

25: for each v-union Q £ Q do 

26: compute B(Q) — Uq/ g q77(Q, Q'). 

27: if \B(Q)\ < k then 

28: add subgraphs Q\B(Q) and G(Q)\Q into G 

29: end if 

30: end for 

31: end if 
32: return G; 


Ga is a v-clique in G and two nodes in Ga are connected if the 
two corresponding v-cliques share more than k nodes. SCREEN 
checks whether each node in G is a common node between two 
v-cliques (Step 3(a)), and if two cliques share more than k nodes, 
merges their v-unions (Step 3(b)), which is equivalent to connect 
two nodes in Ga- Once all nodes in G is scanned, all edges in Ga 
are added, and the order in which we examine nodes in G is inde¬ 
pendent from the structure of Ga and the connected components in 
Ga - Therefore, finding all v-unions in G is order independent. 

Consider order independency of removing nodes in G. Suppose 
Qi, Q2,Qm, 777 > 0 are all v-unions in G with \B{Qf)\ < 
k,i E [ 1 , 777 ]. Since G is finite, Qi is finite and unique; thus, re¬ 
moving all nodes in B(Q)) from G where \B(Q)\ < k is order 
independent. □ 

Note that 777 and a are typically very small, so Screen is ba¬ 
sically linear in the size of the inverted index. Finally, we have 
results similar to Theorem |4. 8 1 for v-unions, so we can further sim¬ 
plify the graph by keeping for each v-union a single representative 
for all nodes that only belong to it. Each result k -overlap subgraph 
is typically very small. 







Example 4 . 12 . Consider Table^as input and k — 1 . Step 2 
creates five v-unions Q i — Qsfor the five v-cliques in the input. 

Step 3 starts with v-clique C\. It has 4 nodes (in the simpli¬ 
fied inverted index), among which 3 are shared with CV Thus, 
B(Ci, C2) — {7*3 — r 5} and \B(Ci, 62)! > 2 , so we merge Q 1 
and Q2 into Q 1/2. Examining C2 reveals no other shared node. 

Step 3 then considers v-clique C 3 . It has three nodes, among 
which 7*12 — 7*13 are shared with C4 and 7*13 is also shared with 
C5. Thus, B(C 3 , C4) = {l 12 — 7*13} afe B(C 3l C5) — {7*13}. 
We merge Q 3 awd Q4 mte Q3/4. Examining C4 and C5 reveals 
no other shared node. We thus obtain three 2 -connected v-unions: 

Q — {Q 1/2, Q 3/4, Qs}. 

Step 4 few considers each v-union. For Q 1 / 2 , B(Q 1 / 2 ) — 0 
and we thus split subgraph Q1/2 out and merge all of its nodes to 
one r 1/.../7. ForQ 3 / 4, B(Q 3/4 ) = {r 13 } so \B(Q 3/4 )\ < 2 . We 
split Q 3 /4 out and obtain {r n — 7*12,7*14/15} (r ± 3 is excluded). 
Similar for Q$ and we obtain {7*16/17/18}. Therefore, we return 
three subgraphs for further screening. □ 

4.3.2 Reduction 

Intuitively, a graph G(V,E) is k -robust if and only if between 
any two nodes a : b E V, there are more than k paths that do not 
share any node except a and b. We denote the number of non¬ 
overlapping paths between nodes a and b by b). We can reduce 
the problem of computing K(a,b) into a Max-flow Problem. 

For each input G(V,E) and nodes a, 6 , we construct the (di¬ 
rected) flow network G'(V', E') as follows. 

1 . Node a is the source and b is the sink (there is no particular 
order between a and b). 

2 . For each v E V, v a,v ^ 6 , add two nodes v',v" to 
V ' 9 and two directed edges (V, v"), (w", v') to E'. If v' 
represents n nodes, the edge (v' , v") has weight n, and the 
edge (v'\ v') has weight 00. 

3 . For each edge (a, v) E E, add edge (a, v) to E for each 
edge (w, b) G E, add edge (u ", b) to E'\ for each other edge 
( u , v ) G E, add two edges (u ", 7/) and (u", u ) to E'. Each 
edge has capacity 00. 

Lemma 4 . 13 . The max flow from source a to sink b in G' ( V ', E') 
is equivalent to K(a,b) in G(V, E). □ 

Proof. According to Menger’s Theorem J 3 J, the minimum num¬ 
ber of nodes whose removal disconnects a and b , that is «(a, b ), is 
equal to the maximum number of independent paths between a and 
b. The authors in ( 5 J proves that the maximum number of inde¬ 
pendent paths between a and b in an undirected graph G(V,E) is 
equivalent to the maximal value of flow from a to b or the minimal 
capacity of an a — b cut, the set of nodes such that any path from a 
to b contains a member of the cut, in G'(V ', E'). □ 

Example 4 . 14 . Consider nodes n and 7*6 of graph G2 in Fig¬ 
ure [ 2 ] Figure [?] shows the corresponding flow network, where the 
dash line (across edges (7*3, r 3 ), (7*4, 7*4 )) in the figure cuts the flow 
from 7*1 to 7*6 with a minimum cost of 2 . The maxflow/min cut has 
value 2 . Indeed, ^(7*1, 7 * 6 ) — 2 . □ 

Recall that in a (k + 1 )-connected v-union, between each pair of 
nodes there are at least k + 1 paths. Thus, if ( 1 ) k(cl, b) = k + 1 , 
( 2 ) a and b belong to different v-unions, and ( 3 ) a and a' belong to 
the same v-union, we must have n(a , 6 ) > k + 1 . We thus have 
the following sufficient and necessary condition for /c-robustness. 

Theorem 4.15 (Max-flow condition). Let G(V, E ) be 
an input similarity graph. Graph G is k-robust if and only if for 
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Figure 3: Flow network for G 2 in Figure [2] 

every pair of adjacent (k + 1)-connected v-unions Q and Q', there 
exist two nodes a G Q\Q' and b G Q' \Q such that the max flow 
from a to b in the corresponding flow network is at least k + 1. □ 

PROOF. According to Menger’s Theorem J5J, «(a, b) in G is 
equivalent to the max-flow from a to b in the corresponding flow 
network. We need to prove that graph G is k- robust if and only 
if for each pair of adjacent ( k + 1)-connected v-unions Q and Q ', 
there exists two nodes a G Q \ Q' and b E Q' \ Q such that 
«(a, b) > k + 1. 

We first prove that if G is k- robust, for each pair of adjacent 
(.k + 1)-connected v-unions Q and Q' , there exists two nodes a G 
Q\Q' and b E Q' \Q such that n(a, b) > k + 1. Since G is k- 
robust, for each pair of nodes a and b in G , we have «(o, b) > k-\- 1 . 

We next prove that if G is not k- robust, there exists a pair of 
adjacent (k + 1)-connected v-unions Q and Q' such that for each 
pair of nodes a G Q\Q' and b E Q'\Q, we have «(a, b) < k + l. 
Since G is not k- robust, there exists a separator S, a set of nodes 
in G with size no greater than k whose removal disconnects G into 
two sub-graphs X and Y. Suppose Q and Q' are two v-unions 
in G such that Q C X, Q' C Y and Q D Q' / 0. For each 
pair of nodes a G Q \Q' and b G Q' \ Q, we have a G X and 
b G Y, and removing the set of nodes in S disconnects a and b ; 
thus ft(a, b) < k + 1. 

The above two cases proves that graph G is /c-robust if and only 
if for every pair of adjacent (k + 1)-connected v-unions Q and 
Q' , there exist two nodes a E Q \ Q' and b E Q' \ Q such that 
«(a, 6) > k + 1, i.e. the max flow from a to b in the corresponding 
flow network is at least k + 1. □ 

If a graph G is not k- robust, we shall split it into subgraphs for 
further processing. In the corresponding flow network, each edge 
in the minimum cut must be between a pair of nodes derived from 
the same node in G (other edges have capacity 00 ). These nodes 
cannot belong to any core and we use them as separator nodes, 
denoted by S. Suppose the separator separates G into X and Y 
(there can be more subgraphs); we return XU S and Y U S. 

Note that we need to include S in both sub-graphs to maintain 
the integrity of each v-union. To understand why, consider G 2 in 
Figured] where S = {r 3 , 7*4}. According to the definition, there is 
no 2-core. If we split G2 into {r± —V2} and {7*5 — 7 * 6 } (without in¬ 
cluding S), both subgraphs are 2-robust and we would return them 
as 2-cores. The problem happens because v-cliques Q\ — Q 4 “dis¬ 
appear” after we remove the separators r 3 and 7*4. Thus, we should 
split G2 into {7*1 — 7*4} and {7*3 — 7 * 6 } instead and that would further 
trigger splitting on both subgraphs. Eventually we wish to exclude 
the separator nodes from any core, so we mark them as “separators” 
and exclude them from the returned cores. 

Algorithm Split (details in Algorithm]^ takes a (fc + l)-overlap 
subgraph G as input and decides if G is k- robust. If not, it splits G 
into subgraphs on which we will then re-apply screening. 

1. For each pair of adjacent (AH-1)-connected v-unions Q,Q' E 
G , find a E Q\Q',b E Q' \Q. Construct flow network 






Algorithm 2 Split(G, C, k ) 

Input: G\ Simplified similarity graph. 

C: Set of cores. 
k : Robustness requirement. 

Output: G Set of subgraphs in G. 

1: for each adjacent (k + l)-connected v-unions Q, Q' do 
2: find a pair of nodes a^Q\Q' : b^Q'\Q. 

3: construct flow-network G r and compute n(a,b) by Ford & 

Fulkerson Algorithm. 

4: if ^(a, b) < k then 

5: get separator S from G' and remove S from G to obtain 

disconnected subgraphs; mark S as “separator” and add it 
to each subgraph in G. 

6: return the set G of subgraphs. 

7: end if 

8 : end for 
9: if G — <f then 
10: add G to C. 

11: end if 
12: return G\ 


G'(V',E') and apply Ford & Fulkerson Algorithm (l3) to 
compute the max flow. 

2. Once we find nodes a, b where ft(a, b) < k, use the min cut 
of the flow network as separator S. Remove S and obtain 
several subgraphs. Add S back to each subgraph and mark 
S as “separator”. Return the subgraphs for screening. 

3. Otherwise, G is k -robust and output it as a k- core. 

Example 4.16. Continue with Example \ 4 . 14 \ and k — 2. There 
are four 3 -connected v-unions. When we check n E Qi and 
tq E Q 3, we find S — {7*3,7*4}. We then split G 2 into subgraphs 
{r 1 — 7*4} and {r3 — 7 * 6 }, marking r 3 and 7*4 as “separators”. 

Now consider graph G 1 in Figure^and k — 2. There are four 
3 -connected v-unions (actually four v-cliques) and six pairs of ad¬ 
jacent v-unions. For Q 1 and Q2, we check nodes 7*2 and ta and find 
k (7*2,7*4) = 3 . Similarly we check for every other pair of adjacent 
v-unions and decide that the graph is 2 -robust. □ 

PROPOSITION 4.17. Let p be the total number of pairs of ad¬ 
jacent v-unions, and g be the number of nodes in the input graph. 
Algorithm SPLIT runs in time 0 (pg 2 ' 5 ). □ 

PROOF. Authors in ( 9 ) proves that it takes in time 0 (g 2 ' 5 ) to 
compute ft(a, b) for a pair of nodes a and b in G. In the worst case 
Split needs to compute hz(a,b) for p pairs of adjacent v-unions. 
Thus, Split runs in time 0 (pg 2 5 ). □ 

Recall that if we solve the Max-Flow Problem directly for each 
pair of sources in the original graph, the complexity is 0(|L| 4 ' 5 ), 
which would be dramatically higher. 

4.3.3 Full algorithm 

We are now ready to present the full algorithm, Core (Algo¬ 
rithm [3}. Initially, it initializes the working queue Q with only 
input G (Line [1}. Each time it pops a subgraph G' from Q and 
invokes Screen (Lines [3J4J. If the output of Screen is still G' 
(so G' is a (k + l)-overlap subgraph) (Line[5}, it removes any node 
with mark “separator” in G r and puts the new subgraph into the 
working queue (Line [ 7 ]), or invokes Split on G' if there is no sep¬ 
arator (Line[9]). Subgraphs output by Screen and Split are added 
to the queue for further examination (Lines [lO] [13]) and identified 
cores are added to C, the core set. It terminates when Q = 0 . 


Algorithm 3 Core(G, k) 

Input: G\ Simplified similarity graph, represented by C and L. 

k: Robustness requirement. 

Output: C Set of cores in G. 

1: Let Q = {G}, (7 = 0; 

2: while Q fz <f do 
3: Pop G' from Q; 

4: Let P = Screen(G', k, C ); 

5: if P = {G'} then 

6: if G' contains “separator” nodes then 

7: Remove separators from G' and add the result to Q if 

it is not empty; 

8 : else 

9: Let S = Split^G', k , C)\ 

10: add graphs in S to Q; 

11: end if 

12: else 

13: add graphs in P to Q; 

14: end if 

15: end while 
16: return G\ 


The correctness of algorithm CORE is guaranteed by the follow¬ 
ing Lemmas. 

Lemma 4.18. For each pair of adjacent nodes r,r' in graph 
G, there exists a maximal k-robust partitioning such that r, r' are 
in the same subgraph. □ 

PROOF. For each pair of adjacent nodes r, r in G, we prove the 
existence of such a maximal k -robust partitioning by constructing 
it. 

By definition, adjacent node r, r' form a v-clique C. Therefore, 
there exists a maximal v-clique C' in G that contains r, r , i.e., 
C C C' . V-clique C' can be obtained by keep adding nodes in G to 
C so that each newly-added node is adjacent to each node in current 
clique until no nodes in G can be added to C'. By definition, any 
v-clique is k- robust, therefore there exists a maximal k -robust sub¬ 
graph G' in G such that C' C G'. Graph G' can be obtained by 
keep adding nodes in G to C' so that each newly-added node is 
adjacent to at least k + 1 nodes in current graph G' until no nodes 
in G can be added to G'. We remove G r from G and take G' as a 
subgraph in the desired partitioning. 

We repeat the above process to a randomly-selected pair of ad¬ 
jacent nodes in the remaining graph G\G' until it is empty. The 
desired partitioning satisfies Condition 1 and 2 of Definition |4.2| 
because the above process makes sure each subgraph is exclusive 
and k -robust; it satisfies Condition 3 of Definition |4.2| because the 
above process makes sure each subgraph is maximal, which means 
merging arbitrary number of subgraphs in the partitioning would 
violate Condition 2. 

In summary, the desired partitioning is a maximal k -robust par¬ 
titioning. It proves that for each pair of adjacent nodes r and r in 
graph G, there exists a maximal k -robust partitioning such that r 
and r' are in the same subgraph. □ 

Lemma 4.19. The set of nodes in a separator S of graph G 
does not belong to any k-core in G, where |£| < k. □ 

Proof Lemma |4.191 Suppose the set S of nodes separate G 
into 777 disconnected sets Xi,i G [l,m],m > 0. To prove that 
each node r E S does not belong to any k -core in G, we prove 
that for a node r' E G,r' / r, there exists a maximal k -robust 















partitioning such that r and r' are separated. Node r falls into the 
following cases: 1) r' G X*, z G [1, m\ ; 2) r' G S. 

Consider Case 1) where r' G Xi,z G [1, rri]. We construct a 
maximal /c-robust partitioning of G where r and r are in different 
subgraphs. We start with a maximal /c-robust subgraph G r in G 
that contains r and r" where r" is adjacent to r and in Xj , j / 
z, j G [1, ml , and find other maximal k- robust subgraphs as in 
Lemma |4.18| Since 5 separates X* and Xj, maximal /c-robust 
subgraph G' that contains r and r" does not contain any node in 
Xi . It proves that there exists a maximal /c-robust partitioning of G 
where r and r' are not in the same subgraph. 

Consider Case 2) where r G S. We construct a maximal k- 
robust partitioning of G such that r and r are in different sub¬ 
graphs. We create two maximal /c-robust subgraphs G' and G ", 
where G' contains r and an adjacent node n G Xi, i G [1, m], G" 
contains r' and an adjacent node ry G Xj , j z, j G [l,m]. We 
create other subgraphs as in Lemma [4~l8| Since each path between 
ri G Xi and 7*j G Xj contains at least one node in S and \S\ < k , 
graph G' U G" is not /c-robust. Therefore, the created partition¬ 
ing is a maximal /c-robust partitioning. It proves that there exists a 
maximal /c-robust partitioning of G where r and r' are not in the 
same subgraph. 

Given the above two cases, we have that any node in separator S 
of G does not belong to any k- core in G, where \S\ < k. □ 


Theorem 4.20. Let G be the input graph and q be the number 
of (/c + 1) -connected v-unions in G. Define a,p, g , m, and \L\ as in 
Proposition \ 4.11 \ an d\4. 1 7\ Algorithm CORE finds correct k-cores 
ofG in time 0 (q((m z + a)\L\ + pg 2 ' 5 )) and is order independent. 


PROOF. We first prove that Core correctly finds k- cores in G, 
that is 1) nodes not returned by CORE do not belong to any k- core; 
2) each subgraph returned by Core forms a k- core. 

We prove that nodes not returned by Core do not belong to any 
/c-core in G. Nodes not returned by Core belong to separators of 
subgraphs in G. Suppose S is a separator of graph G n G Q found 
in either Screen or Split phase, where G n C G, n > 0, Go = 
G , and S separates G n into m sub-graphs X^,z G > 

1. Graph G x n G Q is a subgraph of G n such that any node r G 
X^, j G [ 1, ?7z] , j ^ i does not belong to G\. Nodes remov ed in 
G l n by Core belong to separator S in G n . Given Lema 4.19 such 
nodes do not belong to any k- core in G n and thus does not belong 
to any k- core in G. 

We next prove that each subgraph returned by Core forms a 
k- core in G. We prove two cases: 1) subgraph G' in G forms a k- 
core if there exists a separator S that disconnects G' from G, where 
\S\ < k and G' U S and G' are both /c-robust; 2) if a subgraph is a 
k- core in G r n , it is a k- core in graph G n - 

We consider Case 1) that subgraph G' in G forms a k- core if 
there exists a separator S that disconnects G' from G, where \S\ < 
k and G' U S and G' are both /c-robust. For a pair of nodes r\ , 7*2 
in G ', we prove that there exists no maximal /c-robust partitioning 
where n and 7*2 are in different subgraphs. Suppose such a parti¬ 
tioning exists, and Gi, G 2 are subgraphs containing n , 7*2 respec¬ 
tively. Since Gi, G 2 C G' U S, we have that Gi U G 2 is /c-robust, 
it violates the fact that the result of merging any two subgraphs in 
a maximal /c-robust partitioning is not /c-robust. Therefore, there 
exists no maximal /c-robust partitioning where n and r 2 are in dif¬ 
ferent subgraphs. It proves that G' is a k- core in G. 

We next consider Case 2) that if a subgraph G r is a k- core in 
G l n , it is a /c-core in graph G n . We prove that a pair of nodes 
ri , 7*2 G G' belong to the same subgraph of all maximal /c-robust 
partitioning in G n . Suppose there exists such a partitioning of G n 
where n GGi,r 2 Gft. Since G l n C X^US, we have Gi, G 2 C 


Table 4: Step-by-step core identification in Example|4.21| 


Input 

Method 

Output 

g 2 

Screen 

G2 

g 2 

Split 

G2 = El “ M. G'i = {r 3 - r 6 } 

Gi 

Screen 

G'i = {> 3 }> Gi = {r 4 } 


Screen 

G\ = {7*3}, G\ = {7*4} 

Gi 

Screen 

- 

Gi 

Screen 

- 

G 

Screen 

G l = Ei/.../ 7 },G^ = {rn,r 12 ,r 14/15 }, 
G 3 = {i’ 16 / 17 / 18 } 

G 1 

Screen 

Core {7*1 — r 7 } 

G z 

Screen 

G 4 = {rn},G 5 = {7*14/15} 

G 6 

Screen 

Core {7*16 - 7*18 } 

G 4 

Screen 

- 

G b 

Screen 

Core {7*14 - 7*15} 


Gn, otherwise Gi, G2 are not /c-robust. Since n, 7*2 belong to the 
same k- core in G l n , we have G\ — G 2 . It proves that if G' is a 
k- core in G l n , it is a k- core in G n . 

The above two cases prove that each subgraph returned by Core 
forms a k- core in G. In summary, nodes not returned by Core do 
not belong to any k- core, and each subgraph returned by Core 
forms a k- core in G. Thus, Core correctly finds all k- cores in G. 

It further proves that the result of Core is independent from the 
order in which we find and remove separators of graphs in Q. 

We now analyze the time complexity of Core. For each (/c +1)- 
connected v-unions in G, it takes in time 0 (m 2 + a) \ L \ to proceed 
Screen phase and in time 0(pg 2-5 ) to proceed Split phase. In 
total there are q v-unions in G, thus the algorithm takes in time 
0(q((m 2 + a)\L\ +pp 2 ' 5 )). □ 

Example 4.21. First, consider graph G 2 in Figure^and k = 

2. Table [?] shows the step-by-step core identification process. It 
passes screening and is the input for SPLIT. SPLIT then splits it into 
G\ and Gl where r 3 and r 4 are marked as “separators SCREEN 
further splits each of them into {7*3} and {7*4}, both discarded as 
each represents a single node (and is a separator). So CORE does 
not output any core. 

Next, consider the motivating example, with the input shown in 
Tableland k — 1. Originally, Q = {G}. After invoking SCREEN 
on G, we obtain three subgraphs G 1 , G 2 , and G 3 . SCREEN out¬ 
puts G 1 and G 3 as 1 -cores since each contains a single node that 
represents multiple records. It further splits G 2 into two single¬ 
node graphs G 4 and G 5 , and outputs the latter as a 1-core. Note 
that if we remove the 1 -robustness requirement, we would merge 
T 11 — t is to the same core and get false positives. □ 

Case study: On the data set with 18M records, our core-identification 
algorithm finished in 2.2 minutes. SCREEN was invoked 114K 
times and took 2 minutes (91%) in total. Except the original graph, 
an input contains at most 39.3K nodes; for 97% inputs there are 
fewer than 10 nodes and running Screen was very fast. Split 
was invoked only 26 times; an input contains at most 65 nodes (13 
v-unions) and on average 7.8 (2.7 v-unions). Recall that the simpli¬ 
fied inverted index contains 1.5M entries, so SCREEN reduced the 
size of the input to Split by 4 orders of magnitude. 

5. GROUP LINKAGE 

The second stage clusters the cores and the remaining records, 
which we call satellites , into groups. To avoid merging records 
based only on weak evidence, we require that a cluster cannot con¬ 
tain more than one satellite but no core. Comparing with clustering 
in traditional record linkage, our algorithm differs in three aspects. 
First, in addition to weighting each attribute, we weight the values 


























according to their popularity within a group such that similarity 
on primary values (strong evidence) is rewarded more. Second, 
we treat all values for dominant-value attributes as a whole, we 
are tolerant to differences on local values from different entities in 
the same group. Third, we distinguish weights for distinct values 
and non-distinct values such that similarity on distinct values is re¬ 
warded more. This section first describes the objective function for 
clustering (Section |5.1| ) and then proposes a greedy algorithm for 
clustering (Section |T2] l. 

5.1 Objective function 

SV-index: Ideally, we wish that each cluster is cohesive (each el¬ 
ement, being a core or a satellite, is close to other elements in the 
same cluster) and different clusters are distinct (each element is 
fairly different from those in other clusters). Since records in the 
same group may have fairly different local values, we adopt Silhou¬ 
ette Validation Index (SV-index) |25] [ as the objective function as it 
is more tolerant to diversity within a cluster. Given a clustering C 
of elements E, the SV-index of C is defined as follows. 


S(C) = Avg eeE S(e)] 

= q(e) - 6(e) + a 
max{a(e), 6(e)} + /3 


( 1 ) 

( 2 ) 


Here, a(e) G [0, 1] denotes the similarity between element e and its 
own cluster, b(e) G [0, 1] denotes the maximum similarity between 
e and another cluster, /3 > a > 0 are small numbers to keep S(e) 
finite and non-zero (we discuss in Section[6]how we set the param¬ 
eters). A nice property of S(e) is that it falls in [—1,1], where a 
value close to 1 indicates that e is in an appropriate cluster, a value 
close to — 1 indicates that e is mis-classified, and a value close to 0 
while a(e) is not too small indicates that e is equally similar to two 
clusters that should possibly be merged. Accordingly, we wish to 
obtain a clustering with the maximum SV-index. We next describe 
how we compare an element with a cluster. 

Similarity computation: We consider that an element e is similar 
to a cluster Cl if they have highly similar values on common-value 
attributes ( e.g ., name), share at least one primary value (we ex¬ 
plain “primary” later) on dominant-value attributes (e.g., phone, 
URL); in addition, our confidence is higher if they also share val¬ 
ues on multi-value attributes (e.g., category). Following previous 
work on handling multi-value attributes (7] [21], we compute the 
similarity sim(e , Cl) as follows. 

sim(e , Cl) = min{l, sim s (e, Cl) + rw rn sim rnu i t i(e , Cl)}] (3) 

/ \ w c sim CO m(e,Cl) + w 0 sim dorn (e,Cl) 

sim s (e,Cl)= -; (4) 

w c + w 0 

= f 0 if sim s (e, Cl) < 6 th , 

1 1 otherwise. ' 


Here, sim CO m, sirridom, and sirrimuiti denote the similarity for 
common-, dominant-, and multi-attributes respectively. We take 
the weighted sum of sim CO m and sirridom as strong indicator of e 
belonging to Cl (measured by sim s (e, Cl)), and only reward weak 
indicator sirrimuiti if simn s (e , Cl) is above a pre-defined threshold 
9th\ the similarity is at most 1. Weights 0 < w c ,w 0 ,Wm < 1 
indicate how much we reward value similarity or penalize value 
difference; we learn the weights from sampled data. We next high¬ 
light how we leverage strong evidence from cores and meanwhile 
remain tolerant to other different values in similarity computation. 

First, we identify primary values (strong evidence) as popular 
values within a cluster. When we maintain the signature for a core 
or a cluster, we keep all values of an attribute and assign a high 
weight to a popular value. Specifically, let R be a set of records. 


Consider value v and let R(v) C R denote the records in R that 
contain v. The weight of v is computed by w(v) = . 

Example 5.1. Consider phone for core Cr\ — {n — 7 * 7 } 
in Table [2] There are 7 business listings in Cr\, 5 providing 808 
(r 1 — 7 * 5 ), one providing 101 (tq), and one providing 102 ( 7 * 7 ). 
Thus, the weight of 808 is | = .71 and the weight for 101 and 102 
is y = .14, showing that 808 is the primary phone for Cr±. □ 

Second, when we compute sirridom(e, Cl), we consider all the 
dominant-value attributes together, rewarding sharing primary val¬ 
ues (values with a high weight) but not penalizing different values 
unless there is no shared value. Specifically, if the primary value 
of an element is the same as that of a cluster, we consider them 
having probability p to be in the same group. Since we use weights 
to measure whether the value is primary and allow slight difference 
on values, with a value v from e and v from Cl, the probabil¬ 
ity becomes p • w e (v) • wci(v') • s(v, v'), where w e (v) measures 
the weight of v in e, wci(v') measures the weight of v in Cl, 
and s(v,v') measures the similarity between v and v'. We com¬ 
pute sirridom (a Cl) as the probability that they belong to the same 
group given several shared values as follows. 

sim do m(e, Cl) = 1- Y[ (1 -p-w e (v)-w C i(v')-s(v,v')). (6) 

v£e,v' Ech 

When there is no shared primary value, sirridom can be close to 
0; once there is one such value, sirridom can be significantly in¬ 
creased, since we typically set a large p. 

Example 5.2. Consider element e — r& and cluster Cl\ — 
{ 7*1 — 7 * 7 } in Example \l.l\ Assume p = .9. Element e and Ch share 
the primary email domain, with weight 1 and | = .71 respectively, 
but have different phone numbers (assuming similarity of 0). We 
compute sirridom(e , Cli) = 1 — (1 — .9 • 1 • .71 • 1) • (1 — 0) • (1 — 
0) • (1 — 0) = .639; essentially, we do not penalize the difference in 
phone numbers. Note however if homedepot appeared only once so 
was not a primary value, its weight would be .14 and accordingly 
sirridom^, Clf) — .126, indicating a much lower similarity. □ 

Third, when we learn weights, we learn one set of weights for 
distinct values (appearing in only one cluster) and one set for non- 
distinct values, such that distinct values, which can be considered 
as stronger evidence, typically contribute more to the final similar¬ 
ity. In Example |1.1[ sharing “ Home Depot, The ” would serve as 
stronger evidence than sharing Taco Casa for group similarity. 

5.2 Clustering algorithm 

In most cases, clustering is intractable (l4]|26j. We maximize 
the SV-index in a greedy fashion. Our algorithm starts with an 
initial clustering and then iteratively examines if we can improve 
the current clustering (increase SV-index) by merging clusters or 
moving elements between clusters. According to the definition of 
SV-index, in both initialization and adjusting, we always assign an 
element to the cluster with which it has the highest similarity. 

Initialization: Initially, we (1) assign each core to its own cluster 
and (2) assign a satellite r to the cluster with the highest similarity 
if the similarity is above threshold 6 i n i and create a new cluster 
for r otherwise. We update the signature of each core along the 
way. Note that initialization is sensitive in the order we consider 
the records. Although designing an algorithm independent of the 
ordering is possible, such an algorithm is more expensive and our 
experiments show that the iterative adjusting can smooth out the 
difference. 







Ca Cb Cc Cd 

Figure 4: Clustering of m — 7*20 in Table[2j 

Table 5: Element-cluster similarity and SV-index for cluster¬ 
ings in Figure [3J Similarity between an element and its own 
cluster is in bold and the second-to-highest similarity is in italic. 
Low S(e) scores are in italic. 



Cl 2 

Cl 3 

Cl 4 

Cl 5 

Cl 6 

S(e) 

Cr 2 

.9 

.5 

.5 

.5 

.5 

.44 

Cr 3 

.6 

1 

.5 

.5 

.5 

.4 

r 11 

.7 

.5 

1 

.5 

.5 

.3 

7-12 

.99 

.5 

.95 

.5 

.5 

.05 

7*13 

1 

.9 

.95 

.5 

.5 

.05 

7-19 

.5 

.5 

.5 

1 

.5 

.5 

U20 

.5 

.5 

.5 

.5 

1 

.5 


(a) Cluster C a 



C1 2 

Cl 3 

Cl 5 

Cl 6 

S(r) 

Cr 2 

.87 

.5 

.5 

.5 

.43 

Cr 3 

.58 

1 

.5 

.5 

.42 

7*11 

.79 

.5 

.5 

.5 

.37 

7-12 

.96 

.5 

.5 

.5 

.48 

7*13 

.97 

.9 

.5 

.5 

.07 

7-19 

.5 

.5 

1 

.5 

.5 

7*20 

.5 

.5 

.5 

1 

.5 


(b) Cluster C&. 


Example 5.3. Continue with the motivating example in Ta- 
ble\2\ First, consider records r\ — rio, where Cri m {7*1 — 7*7} is 
a core. We first create a cluster Ch for Cr\. We then merge records 
rs — r\o to Cl\ one by one, as they share similar names, and either 
primary phone number or primary URL. 

Now consider records m — 7 * 20 ; recall that there are 2 cores 
and 5 satellites after core identification. Figure^shows the initial¬ 
ization result C a - Initially we create two clusters C/2, C/3 for cores 
Cr 2 , Cr 3 . Records r\ 1 , rig — r 2 o do not share any primary value on 
dominant-value attributes with C/2 or C/3, so have a low similarity 
with them; we create a new cluster for each of them. Records r 12 
and ri3 share the primary phone with Cr 2 so have a high similar¬ 
ity; we link them to C/2. □ 

Cluster adjusting: Although we always assign an element e to the 
cluster with the highest similarity so S (e) > 0, the result clustering 
may still be improved by merging some clusters or moving a subset 
of elements from one cluster to another. Recall that when S(e) is 
close to 0 and a(e) is not too small, it indicates that a pair of clusters 
might be similar and is a candidate for merging. Thus, in cluster 
adjusting, we find such candidate pairs, iteratively adjust them by 
merging them or moving a subset of elements between them, and 
choose the new clustering if it increases the SV-index. 

We first describe how we find candidate pairs. Consider element 
e and assume it is closest to clusters Cl and Cl' . If S(e) < 6 S , 
where 6 S is a threshold for considering merging, we call it a border 
element of Cl and Cl' and consider (Cl, Cl') as a candidate pair. 
We rank the candidates according to (1) how many border elements 
they have and (2) for each border element e, how close S(e) is to 
0. Accordingly, we define the benefit of merging Cl and Cl' as 

b(Cl, Cl') = T,eisa border of ci and cvO ~ S ( e ))’ and rank the 
candidate pairs in decreasing order of the benefit. 

We next describe how we re-cluster elements in a candidate pair 
(Cl, Cl'). We adjust by merging the two clusters, or moving the 
border elements between the clusters, or moving out the border el¬ 



Figure 5: Reclustering plans for Ch and Cl 2 . 


ements and merging them. Figure [5] shows the four re-clustering 
plans for a candidate pair. Among them, we consider those that 
are valid (i.e., a cluster cannot contain more than one satellite but 
no core) and choose the one with the highest SV-index. When we 
compute SV-index, we consider only elements in Cl, Cl' and those 
that are second-to-closest to Cl or Cl' (their a(e) or b(e) can be 
changed) such that we can reduce the computation cost. After the 
adjusting, we need to re-compute S(e) for these elements and up¬ 
date the candidate-pair list accordingly. 

Example 5 . 4 . Consider adjusting cluster C a in Figure^ Ta- 
ble^a) shows similarity of each element-cluster pair and SV-index 
of each element. Thus, the SV-index is .32. 

Suppose 0 S — .3. Then, rn — ri 3 are border elements of C /2 
and Ch, where b(Cl 2 , C/ 4 ) = .7 + .95 + .95 s=a 2.6 (there is a 
single candidate so we do not need to compare the benefit). For the 
candidate, we have two re-clustering plans, {{rn — ri 3 ,Cr 2 }}, 
{{rn — ri 3 }, {Cr 2 }}, while the latter is invalid. For the former 
(Cb in Figurewe need to update S(e) for every element and the 
new SV-index is .4 (Table^b)), higher than the original one. □ 

The full clustering algorithm CLUSTER (details in Algorithm [4j 
goes as follows. 

1. Initialize a clustering C and a list Que of candidate pairs 
ranked in decreasing order of merging benefit. (Lines [Tp] l. 

2. For each candidate pair (Cl, Cl') in Que do the following. 

(a) Examine each valid adjusting plan and compute SV-index 
for it, and choose the one with the highest SV-index. (Line[4]). 

(b) Change the clustering if the new plan has a higher SV- 
index than the original clustering. Recompute S(e) for each 
relevant element e and move e to a new cluster if appropriate. 
Update Que accordingly. (Lines [6JT6J . 

3. Repeat Step 2 until Que m 0. 


PROPOSITION 5.5. Let l be the number of distinct candidate 
pairs ever in Que and |E| be the number of input elements. Algo¬ 
rithm Cluster takes time 0(1 • |E| 2 ). □ 

PROOF. It takes time 0(|E| 2 ) to initialize clustering C and list 
Que. It takes |E| 2 to check each distinct candidate pair in Que, 
where it takes 0(|E|) to examine all valid clustering plans and se¬ 
lect the one with highest SV-index (Step 2(a)), and it takes 0(|E| 2 ) 
to recompute SV-index for all relevant elements and update Que 
(Step 2(b)). In total there are / distinct candidate pairs ever in Que, 
thus Cluster takes time 0(1 • |E| 2 ). □ 

Note that we first block records according to name similarity and 
take each block as an input, so typically |E| is quite small. Also, in 
practice we need to consider only a few candidate pairs for adjust¬ 
ing in each input, so / is also small. 






















Algorithm 4 Cluster(E, 0 s ) 

Input: E: A set of cores and satellites for clustering. 

0 S : Pre-defined threshold for considering merging. 

Output: C : A clustering of elements in E. 

1: Initialize C according to E; 

2: Compute S(C) and generate a list Que of candidate pairs; 

3: for each candidate pair (Cl, Cl') G Que do 
4: compute SV-index for its valid re-clustering plans and 

choose the clustering Cmax with the highest SV-index; 

5: if S(C ) < S(Cmax) then 

6: let C = Cmax, change — true ; 

7: while change do 

8: change = false ; 

9: for each relevant element e do 

10: recompute S (e); 

11: When appropriate, move e to a new cluster and set 

change — true ; 

12: if S(e) < 6 S in the previous or current C then 

13: update the merging benefit of the related candi¬ 

date pair and add it to Que or remove it from Que 
when appropriate; 

14: end if 

15: end for 

16: end while 

17: end if 

18: end for 
19: return C\ 


Table 6: Statistics of the experimental data sets. 



# Records 

#Groups 
(size >1) 

Group size 

#Singletons 
(size =1) 

Random, 

2062 

30 

[2, 308] 

503 

AI 

2446 

1 

2446 

0 

UB 

322 

9 

[2, 275] 

5 

FBIns 

1149 

14 

[33, 269] 

0 

SIGMOD 

590 

71 

[2,41] 

162 


Example 5.6. Continue with Example \5.4\ and consider ad¬ 
justing Cb. Now there is one candidate pair (C/ 2 , C/ 3 ), with border 
7 * 13 . We consider clusterings C c and Cd- Since S(C c ) — -37 < .40 
and S(Cd) — .32 < .40, we keep Cb and return it as the result. We 
do not merge records C /2 = {r n — r 15 } with C /3 = {tiq — ris}, 
because they share neither phone nor the primary URL. CLUSTER 
returns the correct chains. □ 

6. EXPERIMENTAL EVALUATION 

This section describes experimental results on two real-world 
data sets, showing high scalability of our techniques, and advan¬ 
tages of our algorithm over rule-based or traditional machine-learning 
methods on accuracy. 

6.1 Experiment settings 

Data and gold standard: We experimented on two real-world data 
sets. Biz contains 18M US business listings and each listing has at¬ 
tributes name, phone, URL, location and category; we decide 
which listings belong to the same business chain. SIGMOD con¬ 
tains records about 590 attendees of SIGMOD’98 and each record 
has attributes name, affiliation, address, phone, fax and email; 
we decide which attendees belong to the same institute. 

We experimented on the whole Biz data set to study scalability 
of our techniques. We evaluated accuracy of our techniques on five 
subsets of data. The first four are from Biz. (1) Random contains 
2062 listings from Biz , where 1559 belong to 30 randomly selected 
business chains, and 503 do not belong to any chain; among the 503 


listings, 86 are highly similar in name to listings in the business 
chains and the rest are randomly selected. (2) AI contains 2446 list¬ 
ings for the same business chain Allstate Insurance. These listings 
have the same name, but 1499 provide URL “ allstate.com ”, 854 
provide another URL “allstateagencies.com”, while 130 provide 
both, and 227 listings do not provide any value for phone or URL. 
(3) UB contains 322 listings with exactly the same name Union 
Bank and highly similar category values; 317 of them belong to 9 
different chains while 5 do not belong to any chain. (4) FBIns data 
set contains 1149 listings with similar names and highly similar 
category values; they belong to 14 different chains. Among the list¬ 
ings, 708 provide the same wrong name Texas Farm Bureau Insur¬ 
ance and meanwhile provide a wrong URL farmbureauinsurance- 
mi.com. Among these four subsets, the latter three are hard cases; 
for each data set, we manually verified all the chains by checking 
store locations provided by the business-chain websites and used it 
as the gold standard. The last “subset” is actually the whole SIG¬ 
MOD data set. It has very few wrong values, but the same affiliation 
can be represented in various ways and some affiliation names can 
be very similar (e.g., UCSC vs. UCSD). We manually identified 71 
institutes that have multiple attendees and there are 162 attendees 
who do not belong to these institutes. Table [6] shows statistics of 
the five subsets. 


Measure: We considered each group as a cluster and compared 


pairwise linking decisions with the gold standard. We measured the 
quality of the results by precision (P), recall (P), and F-measure 
(. F ). If we denote the set of true-positive pairs by TP, the set of 


false-positive pairs by PP, and the set of false-negative pairs by 


FN then P — __ 

r iV ’ Uien » r ~ \TP\ + \FP\ ’ 


addition, we reported execution time. 


r - _ITPJ_ 

~ [TP| + |FiV] ’ 


F = 


2 PR 
P+R’ 


In 


Implementation: We implemented the technique we proposed in 
this paper, and call it Group. In core generation, for Biz we con¬ 
sidered two records are similar if (1) their name similarity is above 
.95; and (2) they share at least one phone or URL domain name. 
For SIGMOD we require (1) affiliation similarity is above .95; and 
(2) they share at least one of phone prefix (3-digit), fax prefix (3- 
digit), email server, or the addresses have a similarity above .9. We 
required 2-robustness for cores. In clustering, (1) for blocking, we 
put records whose name similarity is above .8 in the same block; (2) 
for similarity computation, we computed string similarity by Jaro- 
Winkler distance 15], we set a — .01,(3 — .02 ,0th = .6,p = .8, 
and we learned other weights from 1000 records randomly selected 
from Random data for Biz , and 300 records randomly selected from 
SIGMOD. We discuss later the effect of these choices. 

For comparison, we also implemented the following baselines: 


• SameName groups Biz records with highly similar names 
and groups SIGMOD records with highly similar affiliations 
(similarity above .95); 

• ConnectedGraph generates the similarity graph as 
Group but considers each connected subgraph as a group; 

• One-stage machine-learning linkage methods include Par¬ 
tition, Center and Merge | p~6) ; each method computes 
record similarity by Eq.{3]) with learned weights. 

• Two-stage method Yoshida (30) generates cores by agglom- 
erative clustering with threshold .9 in the first stage, uses 
TF/IDF weights for features and applies linear algebra to as¬ 
sign each record to a group in the second stage. 


We implemented the algorithms in Java. We used a Linux ma¬ 
chine with Intel Xeon X5550 processor (2.66GHz, cache 8MB, 
6.4GT/s QPI). We used MySQL to store the data sets and stored 
the index as a database table. Note that after blocking, we can fit 
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Figure 6: Overall results on Biz data set. 


SAMENAME CONNECTEDGRAPH ■ MERGE ■ YOSHIDA ■ GROUP MERGE CORE ■ CLUSTER ■ GROUP 



(a) Overall results (b) Contribution of different components 

Figure 7: Results on SIGMOD data. 


each block of nodes or elements into main memory, which is typi¬ 
cally the case with a good blocking strategy. 

6.2 Evaluating effectiveness 

We first evaluate effectiveness of our algorithms. Figure [6] and 
Figure [TJa) compare Group with the baseline methods, where for 
the three one-stage linkage methods we plot only the best results. 
On FBIns , all methods put all records in the same chain because a 
large number (708) of listings have both a wrong name and a wrong 
URL. We manually perturbed the data as follows: (1) among the 
708 listings with wrong URLs, 408 provide a single (wrong) URL 
and we fixed it; (2) for all records we set name to “ Farm Bureau 
Insurance ”, so removed hints from business names. Even after 
perturbing, this data set remains the hardest and we use it hereafter 
instead of the original one for other experiments. 

We have the following observations. (1) Group obtains the 
highest F-measure (above .9) on each data set. It has the high¬ 
est precision most of the time as it applies core identification and 
leverages the strong evidence collected from resulting cores. It also 
has a very high recall (mostly above .95) on each subset because the 
clustering phase is tolerant to diversity of values within chains. (2) 
The F-measure of SameName is 7-80% lower than GROUP. It can 
have false positives when listings of highly similar names belong 
to different chains and can also have false negatives when some 
listings in a chain have fairly different names from other listings. 
It only performs well in Al, where it happens that all listings have 
the same name and belong to the same chain. (3) The F-measure 
of CONNECTEDGRAPH is 2-39.4% lower than SameName. It re¬ 
quires in addition sharing at least one value for dominant-value at¬ 
tributes. As a result, it has a lower recall than SameName; it has 
fewer false positives than SameName, but because it has fewer 
true positives, its precision can appear to be lower too. (4) The 
highest F-measure of one-stage linkage methods is 1-94.7% higher 


Figure 8: Contribution of different components on Biz . 

than CONNECTEDGRAPH. As they require high record similarity, 
it has similar number of false positives to CONNECTEDGRAPH but 
often has much more true positives; thus, it often has a higher re¬ 
call and also a higher precision. However, the highest F-measure 
is still 1-38.7% lower than Group. (5) YOSHIDA has comparable 
precision to GROUP since its first stage is conservative too, which 
makes it often improve over the best of one-stage linkage methods 
on Biz dataset where reducing false positives is a big challenge; 
on the other hand, its first stage is often too conservative (requiring 
high record similarity) so the recall is 10-34.6% lower than Group, 
which also makes it perform worse than one-stage linkage methods 
on Sigmod dataset where reducing false negatives is challenging. 

Contribution of different components: We compared Group 
with (1) CORE, which applies Algorithm COREIDENTIFICATION 
but does not apply clustering, and (2) CLUSTER, which considers 
each individual record as a core and applies Algorithm CLUSTER 
(in the spirit of (20||28]). Figure[8]and Figure[7|b) show the results. 
First, we observe that Core improves over one-stage linkage meth¬ 
ods on precision by .1-78.6% but has a lower recall (1.5-34.3% 
lower) most of the time, because it sets a high requirement for 
merging records into groups. Note however that its goal is indeed 
to obtain a high precision such that the strong evidence collected 
from the cores are trustworthy for the clustering phase. Second, 
Cluster often has higher precision (by 1.6-77.3%) but lower re¬ 
call (by 2.5-32.2%) than the best one-stage linkage methods; their 
F-measures are comparable on each data set. On some data sets 
( Random, FBIns) it can obtain an even higher precision than CORE, 
because CORE can make mistakes when too many records have er¬ 
roneous values, but CLUSTER may avoid some of these mistakes by 
considering also similarity on state and category. However, ap¬ 
plying clustering on the results of Cluster would not change the 
results, but applying clustering on the results of Core can obtain 
a much higher F-measure, especially a higher recall (98% higher 
than Cluster on Random). This is because the result of CLUS¬ 
TER lacks the strong evidence collected from high-quality cores so 
the final results would be less tolerant to diversity of values, show¬ 
ing the importance of core identification. Finally, we observe that 
Group obtains the best results in most of the data sets. 

We next evaluate various choices in the two stages. Unless spec¬ 
ified otherwise, we observed similar patterns on each data set from 
Biz and Sigmod, and report the results on Random or perturbed 
FBIns data, whichever has more distinguishable results. 

6.2.1 Core identification 
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Figure 9: Core identification on perturbed FBIns data. 
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Figure 10: Effect of graph generation on Random data. 
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Figure 11: Effect of robustness requirement on Random data. 



Figure 12: Clustering strategies on Random data. 


Core identification: We first compared three core-generation strate¬ 
gies: Core iteratively invokes Screen and Split, OnlyScreen 
only iteratively invokes Screen, and YOSHIDAI generates cores 
by agglomerative clustering (30) . Recall that by default we ap¬ 
ply Core. Figure [9] compares them on the perturbed FBIns data. 
First, we observe similar results of OnlyScreen and Core on 
all data sets since most inputs to Split pass the /c-robustness test. 
Thus, although Screen in itself cannot guarantee soundness of the 
resulting cores (k-robustness), it already does well in practice. Sec¬ 
ond, YoshidaI has lower recall in both core and clustering results, 
since it has stricter criteria in core generation. 

Graph generation: We compared three edge-adding strategies for 
similarity graphs: SlM takes weighted similarity on each attribute 
except location and requires a similarity of over .8; TwoDom re¬ 
quires sharing name and at least two values on dominant-value at¬ 
tributes; OneDom requires sharing name and one value on dominant- 
value attributes. Recall that by default we applied OneDom. Fig¬ 
ure [To] compares these three strategies. We observe that (1) SlM 
requires similar records so has a high precision, with a big sacri¬ 
fice on recall for the cores (0.00025); as a result, the F-measure of 
the chains is very low (.59); (2) TwoDom has the highest require¬ 
ments and so even lower recall than SlM for the cores (.00002), and 
in turn it has the lowest F-measure for the chains (.52). This shows 
that only requiring high precision for cores with big sacrifice on 
recall can also lead to low F-measure for the chains. 

We also varied the similarity requirement for names and ob¬ 
served very similar results (varying by .04%) when we varied the 
threshold from .8 to .95. 

Robustness requirement: We next studied how the robustness re¬ 
quirement can affect the results (Figure ED- We have three ob¬ 
servations. (1) When k — 0, we essentially take every connected 
subgraph as a core, so the generated cores can have a much lower 
precision; those false positives cause both a low precision and a 
low recall for the resulting chains because we do not collect high- 
quality strong evidence. (2) When we vary k from 1 to 4, the 
number of false positives decreases while that of false negatives 
increases for the cores, and the F-measure of the chains increases 
but only very slightly. (3) When we continue increasing k, the re¬ 
sults of cores and clusters remain stable. This is because setting 


k=4 already splits the graph into subgraphs, each containing a sin¬ 
gle v-clique, so further increasing k would not change the cores. 
This shows that considering /^-robustness is important, but k does 
not need to be too high. 

6.2.2 Clustering 

Clustering strategy: We first compared our clustering algorithm 
with two algorithms proposed for the second stage of two-stage 
clustering: LiuII |22) iteratively applies majority voting to assign 
each record to a cluster and collects a set of representative features 
for each cluster using a threshold (we set it to 5, which leads to 
the best results); YoshidaII 1301 is the second stage of Yoshida. 
Figure p~2|a) compares their results. We observe that our cluster¬ 
ing method improves the recall by 39% over LiuII and by 11% 
over YoshidaII. LiuII may filter strong evidence by the thresh¬ 
old; YoshidaII cannot handle records whose dominant-value at¬ 
tributes have null values well. 

We also compared four clustering algorithms: GreedyInitial 
performs only initialization as we described in Section[5] EXHAUS- 
tiveInitial also performs only initialization, but by iteratively 
conducting matching and merging until no record can be merged to 
any core; ClusterWGreedy applies cluster adjusting on the re¬ 
sults of GreedyInitial, and ClusterWExhaustive applies 
cluster adjusting on the results of ExhaustiveInitial. Recall 
that by default we apply ClusterWGreedy. Figure [T2|T>) com¬ 
pares their results. We observe that (1) applying cluster adjusting 
can improve the F-measure a lot (by 8.6%), and (2) exhaustive ini¬ 
tialization does not significantly improve over greedy initialization, 
if at all. This shows effectiveness of the current algorithm CLUS¬ 
TER. 

Value weight: We then compared the results with and without set¬ 
ting popularity weights for values. Figure [13] compares the results 
with and without setting popularity weights on perturbed FBIns 
data. We observe that setting the popularity weight helps distin¬ 
guish primary values from unpopular values, thus can improve the 
precision. Indeed, on perturbed FBIns data it improves the preci¬ 
sion from .11 to .98, and improves the F-measure by 403%. 

Attribute weight: We next considered our weight learning strat¬ 
egy. We first compared SeparatedDominant, which learns sep¬ 
arated weights for different dominant-value attributes, and UNITED- 





































































































NOPOPWEIGHT ■ POPWEIGHT 



F-measure Precision Recall 


SEPARATEDDOMINANT ■ UNITEDDOMINANT 

1 



F-measure Precision Recall 


Figure 13: Value weights on Figure 14: Dominant-value at- 
perturbed FBIns data. tributes on Random . 
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Figure 15: Distinct values on Figure 16: Attribute weights 
Random data. on Random data. 


Dominant (our default), which considers all such attributes as a 
whole and learns one single weight for them. Figure [T4| shows that 
on Random the latter improves over the former by 95.4% on recall 
and obtains slightly higher precision, because it penalizes only if 
neither phone nor URL is shared and so is more tolerant to differ¬ 
ent values for dominant-value attributes. This shows importance of 
being tolerant to value variety on dominant-value attributes. 

Next, we compared Single Weight, which learns a single weight 
for each attribute, and DoubleWeight (our default), which learns 
different weights for distinct values and non-distinct values for each 
attribute. Figure [f5] shows that DoubleWeight significantly im¬ 
proves the recall (by 94% on Random) since it rewards sharing of 
distinct values, and so can link some satellite records with null 
values on dominant-value attributes to the chains they should be¬ 
long to. This shows importance of distinguishing distinct and non- 
distinct values. 

We also compared three weight-setting strategies: (1) 3Equal 
considers common-value attributes, dominant-value attributes, and 
multi-value attributes, and sets the same weight for each of them; 
(2) 2EQUAL sets equal weight of .5 for common-value attributes 
and dominant-value attributes, and weight of .1 for each multi¬ 
value attribute; (3) Learned applies weights learned from labeled 
data. Recall that by default we applied Learned. Figure |T6] com¬ 
pares their results. We observe that (1) 2EQUAL obtains higher F- 
measure than 3Equal (.64 vs. .54), since it distinguishes between 
strong and weak indicators for record similarity; (2) LEARNED 
significantly outperforms the other two strategies (by 50% over 
2EQUAL and by 76% over 3EQUAL), showing effectiveness of 
weight learning. This shows importance of weight learning. 

Attribute contributions: We then consider the contribution of 
each attribute for chain classification. Figure p7| shows the results 
on the perturbed FBIns data and we have four observations. (1) 
Considering only name but not any other attribute obtains a high 
recall but a very low precision, since all listings on this data set 
have the same name. (2) Considering dominant-value attributes in 
addition to name can improve the precision significantly and im¬ 
prove the F-measure by 104%. (3) Considering category in addi¬ 
tion does not further improve the results while considering state in 
addition even drops the precision significantly, since three chains 
in this data set contain the same wrong value on state. (4) Con¬ 
sidering both category and State improves the recall by 46% and 



F-measure Precision Recall 


Figure 17: Attribute contribution on perturbed FBIns . 

obtains the highest F-measure. 

Robustness w.r.t. parameters: We also ran experiments to test 
robustness against parameter setting. We observed very similar re¬ 
sults when we ranged p from .8 to 1 and 6 t h from .5 to .7. 

6.3 Evaluating efficiency 

Our algorithm finished in 8.3 hours on the whole Biz data set with 
18M listings; this is reasonable given that it is an offline process and 
we used a single machine. Note that simpler methods (we describe 
shortly) took over 10 hours even for the first stage on fragments of 
the Biz data set. Also note that using the Hadoop infrastructure can 
reduce execution time for graph construction from 1.9 hours to 37 
minutes; we skip the details as it is not the focus of the paper. 

Stage I: It spent 1.9 hours for graph construction and 2.2 minutes 
for core generation. To test scalability and understand importance 
of our choices for core generation, we randomly divided the whole 
data set into five subsets of the same size; we started with one sub¬ 
set and gradually added more. We compared five core generation 
methods: Naive applies Split on the original graph; Index op¬ 
timizes Naive by using an inverted index; SIndex simplifies the 
inverted list by Theorem |4. 8 1 U NION in addition merges v-cliques 
into v-unions by Theorem |4.9| CORE (Algorithm 1) in addition 
splits the input graph by Theorem |4.10| Figure p~8{a) shows the 
results and we have five observations. (1) Naive was very slow. 
Even though it applies Split rather than finding the max flow for 
every pair of nodes, so already optimizes by Theorem |4.15[ it took 
6.8 hours on only 20% data and took more than 10 hours on 40% 
data. (2) Index improved Naive by two orders of magnitude just 
because the index simplifies finding neighborhood v-cliques; how¬ 
ever, it still took more than 10 hours on 80% data. (3) SIndex 
improved Index by 41% on 60% data as it reduces the size of the 
inverted index by 64%. (4) UNION improved SIndex by 47% on 
60% data; however, it also took more than 10 hours on 80% data. 
(5) CORE improved UNION significantly; it finished in 2.2 minutes 
on the whole data set so further reduced execution time by at least 
three orders of magnitude, showing importance of splitting. Fi¬ 
nally, for graph construction, Figure p~8{b) shows the linear growth 
of the execution time. 

Stage II: After core identification we have .7M cores and 17.3M 
satellites. It spent 6.4 hours for clustering: 1.7 hours for blocking 
and 4.7 hours for clustering. The long time for clustering is because 
of the huge number of blocks. There are 1.4M blocks with multiple 
elements (a core is counted as one element), with a maximum size 
of 22.5K and an average of 4.2. On only 35 blocks clustering took 
more than 1 minute and the maximum is 2.5 minutes, but for 99.6% 
blocks the size is less than 100 and CLUSTER took less than 60 ms. 
The average time spent on each block is only 9.6 ms. 

6.4 Summary and discussions 

Summary: We summarize our observations as follows. 

1. Identifying cores and leveraging evidence learned from the 
cores is crucial in group linkage. 
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Figure 18: Execution time (we plot only those below 10 hours). 

2. There are often erroneous values in real data and it is im¬ 
portant to be robust against them; applying OneDom and 
requiring k G [1,5] already performs well on most data sets 
that have reasonable number of errors. 

3. Distinguishing the weights for distinct and non-distinct val¬ 
ues, and setting weights of values according to their popular¬ 
ity are critical for obtaining good clustering results. 

4. Our algorithm is robust on reasonable parameter settings. 

5. Our algorithm is efficient and scalable. 

Discussion: In the paper, we present single-machine algorithms to 
identify groups. Performing such date-intensive tasks on powerful 
distributed hardwares and service infrastructures has become popu¬ 
lar, in particular with the emerging of widely advisable MapReduce 
programming model |24||4|[l2|. We next discuss possible parallal- 
ized solutions of our algorithms in Hadoop infrastructure. 

For graph construction, we can proceed in two steps: (1) to cre¬ 
ate all cliques where nodes sharing the same common-value and a 
particular dominant-valued attribute are in the same clique, and (2) 
to find all maximal cliques. In step (1), we first distribute records 
and map a record r to one or more < key , value > pairs where 
key is a value on a particular dominant-value attribute of r and 
value is the value for common-value attribute of r (Mapper). We 
then find cliques in each block with a particular key , and mean¬ 
while keep an inverted list for each block (Reducer). Step (2) takes 
the output inverted lists and cliques in Step (1) as input. It first uses 
each entry in the inverted lists as a < key , value > pair to map 
cliques, so that all cliques that a record r belongs to are mapped 
into the same block (Mapper). We then find all maximal cliques 
within each block (Reducer). 

To detect cores in the similarity graphs, the algorithm proceeds 
iteratively. We can use Spark (3TJ, a cluster computing framework 
to support iterative jobs while retaining the scalability and fault tol¬ 
erance of MapReduce. For each iteration, we first partition the in¬ 
put graphs into blocks so that each block contains all records of the 
same maximal connected component (Mapper), and proceed Core 
within each block in parallel (Reducer). Note that the MapReduce 
solution may not denominate our single-machine solution that takes 
only 2.2 minutes, because of the additional overhead of the MapRe¬ 
duce program. 

In similar ways, we identify groups as follows. We first partition 
the input elements (satellites and cores) into blocks so that each 
block contains elements that may potentially belong to the same 
group (Mapper), and proceed Cluster within each block in par¬ 
allel (Reducer). 

7. CONCLUSIONS 

In this paper we studied how to link records to identify groups. 
We proposed a two-stage algorithm that is shown to be empirically 
scalable and accurate over two real-world data sets. Future work in¬ 
cludes studying the best way to combine record linkage and group 
linkage, extending our techniques for finding overlapping groups, 


and applying the two-stage framework in other contexts where tol¬ 
erance to value diversity is critical. 
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